VILA: Improving Structured Content Extraction from Scientific PDFs Using Visual Layout Groups

نویسندگان

چکیده

Abstract Accurately extracting structured content from PDFs is a critical first step for NLP over scientific papers. Recent work has improved extraction accuracy by incorporating elementary layout information, example, each token’s 2D position on the page, into language model pretraining. We introduce new methods that explicitly VIsual LAyout (VILA) groups, is, text lines or blocks, to further improve performance. In our I-VILA approach, we show simply inserting special tokens denoting group boundaries inputs can lead 1.9% Macro F1 improvement in token classification. H-VILA hierarchical encoding of layout-groups result up 47% inference time reduction with less than 0.8% loss. Unlike prior layout-aware approaches, do not require expensive additional pretraining, only fine-tuning, which reduce training cost 95%. Experiments are conducted newly curated evaluation suite, S2-VLUE, unifies existing automatically labeled datasets and includes dataset manual annotations covering diverse papers 19 disciplines. Pre-trained weights, benchmark datasets, source code available at https://github.com/allenai/VILA.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Layout Group Extraction from Web Content for Effective Adaptation

These days, people access the Web by using various devices and methods, such as PDAs, cellular phones, and voice-based browsers. However, most Web content is designed for desktop computers. Therefore, alreadyexisting Web content should be transcoded to be suitable for each access device and method. For this purpose, some annotation-based transcoding systems have been developed. An annotation is...

متن کامل

Layout and Content Extraction for PDF Documents

Portable document format (PDF) is a common output format for electronic documents. Most PDF documents are untagged and do not have basic high-level document logical structural information, which makes the reuse or modification of the documents difficult. We developed techniques that identified logical components on a PDF document page. The outlines, style attributes and the contents of the logi...

متن کامل

Structured-Content Extraction from the Web for Bibliographic Reference Generation

In this paper we present a system that automatically creates bibliographic indexes from a collection of PDF files by using the file contents to search the Web and later extract the information from the resulting pages. We pay special attention to the techniques used for extracting this data as well as the automatic generation of extraction rules and their evaluation.

متن کامل

Effective Metadata Extraction from Irregularly Structured Web Content

متن کامل

Human layout estimation using structured output learning

In this thesis, we investigate the problem of human layout estimation in unconstrained still images. This involves predicting the spatial configuration of body parts. We start our investigation with pictorial structure models and propose an efficient method of model fitting using skin regions. To detect the skin, we learn a colour model locally from the image by detecting the facial region. The...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Transactions of the Association for Computational Linguistics

سال: 2022

ISSN: ['2307-387X']

DOI: https://doi.org/10.1162/tacl_a_00466